Skip to content

Improve listing performance of Hudi tables#17084

Merged
arunthirupathi merged 1 commit intoprestodb:masterfrom
codope:hudi-metadata-listing
Jan 4, 2022
Merged

Improve listing performance of Hudi tables#17084
arunthirupathi merged 1 commit intoprestodb:masterfrom
codope:hudi-metadata-listing

Conversation

@codope
Copy link
Copy Markdown
Contributor

@codope codope commented Dec 9, 2021

  • Integrate metadata-based listing for Hudi tables. This is
    enabled by a session property.
  • Implement a new DirectoryLister for Hudi that uses
    HoodieTableFileSystemView to fetch data files.
  • Bump Hudi version to 0.10.0 to use above features.

Test plan -
Synced a Hudi table to Hive and queried through Presto
with both metadata enabled and disabled.

== NO RELEASE NOTE ==

Copy link
Copy Markdown
Collaborator

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments. Looking in good shape!

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should look at cleaning the path filter up altogether, if this just true always? Follow up PR?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I kept it to be compatible in case users disable and not use HudiDirectoryLister. But, actually we introduced the path filter in #13818 . Let's remove it in a follow up PR.

Comment thread pom.xml
<dep.druid.version>0.19.0</dep.druid.version>
<dep.jaxb.version>2.3.1</dep.jaxb.version>
<dep.hudi.version>0.9.0</dep.hudi.version>
<dep.hudi.version>0.10.0</dep.hudi.version>
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to be called out in the PR summary?

Comment thread presto-hive/src/main/java/com/facebook/presto/hive/HiveClientConfig.java Outdated
private static boolean shouldUseFileSplitsForHudi(InputFormat<?, ?> inputFormat, Configuration conf, String tablePath)
private static boolean shouldUseFileSplitsForHudi(InputFormat<?, ?> inputFormat, Optional<HoodieTableMetaClient> metaClient)
{
if (inputFormat instanceof HoodieParquetRealtimeInputFormat) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is n't this check same as !isHudiParquetInputFormat(inputFormat) above? any chances to simplify?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HoodieParquetRealtimeInputFormat is subclass of HoodieParquetInputFormat. However, we want to use file splits from the former but not latter. Hence, a separate check.

String partition = FSUtils.getRelativePartitionPath(new Path(tablePath), directory);
if (fileStatuses.isPresent()) {
fileSystemView.addFilesToView(fileStatuses.get());
this.hoodieBaseFileIterator = fileSystemView.fetchLatestBaseFiles(partition).iterator();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just call the same API here? getLatestBaseFiles() and simplify? i think getLatestBaseFiles() calls fetchLatestBaseFiles

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getLatestBaseFiles() also calls ensurePartitionLoadedCorrectly() which adds files to fs view. Since, in this case we have already done that, so directly calling fetchLatestBaseFiles ().

@codope codope force-pushed the hudi-metadata-listing branch from 109ff66 to 6603580 Compare December 17, 2021 14:00
Copy link
Copy Markdown

@arunthirupathi arunthirupathi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this PR contain some unit test instead of manual verifiication ?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this method an override ? If not, why does it return optional, though the underlying value is always present.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, optional is not needed here. I will remove it.

@codope codope force-pushed the hudi-metadata-listing branch from 6603580 to b4049d4 Compare December 21, 2021 15:08
@codope
Copy link
Copy Markdown
Contributor Author

codope commented Dec 21, 2021

Can this PR contain some unit test instead of manual verifiication ?

Hi @arunthirupathi thanks for reviewing the PR. I have added a test for HudiDirectoryLister. Can you take a look again?

Comment thread pom.xml Outdated
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is hbase dependency required ?

Comment thread presto-hive/pom.xml Outdated
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hbase is a pretty big dependency and this could cause issues.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hudi uses HBase in two ways:

  1. The metadata files are written as HFile.
  2. To maintain index of Hudi filegroup id to external bootstrap file.
    In this PR, we just need Bytes class from hbase-common for initializing HoodieTableFileSystemView and I have excluded deps that we don't need. We have created a thinner hudi-presto-bundle which I plan to update in presto-hive pom in a subsequent PR.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently Presto does not pull in other bigger open source jars (hadoop, hive) and the only way to do is to fork the code reduce it to only bare minimal code and then take a dependency on it.

https://github.com/prestodb/presto-hadoop-apache2
https://github.com/prestodb/presto-hive-apache

Hbase will cause dependency hell for Presto as some of the dependencies are conflicting. https://mvnrepository.com/artifact/org.apache.hbase/hbase-common/2.4.8

This will be a blocker to merge this PR.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even linked PR, shades everything which will avoid dependency issues but it will cause the jar size to explode.

What is the size of the shaded jar ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's about 13 MB.
Unfotunately, HFile is only packaged as part of HBase. Would it be more preferable to use hudi-presto-bundle instead of adding hbase directly?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arunthirupathi Can you please review this again? Check the latest commit where i'm using the stripped down hudi-presto-bundle.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For presto-hive can you please do a dependency tree diff of before and after and paste what changed ?

mvn dependency:tree -pl presto-hive (run this on master and the change branch)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Below is the diff. The full dependency tree is in this gist.

sagars@Sagars-MacBook-Pro presto % git diff --no-index presto_hive_master_dep.txt presto_hive_hudi_dep.txt
diff --git a/presto_hive_master_dep.txt b/presto_hive_hudi_dep.txt
index 3f6e69f26a..21733b24f4 100644
--- a/presto_hive_master_dep.txt
+++ b/presto_hive_hudi_dep.txt
@@ -20,8 +20,7 @@
 [INFO] +- com.facebook.presto:presto-expressions:jar:0.268-SNAPSHOT:compile
 [INFO] +- com.facebook.presto:presto-cache:jar:0.268-SNAPSHOT:compile
 [INFO] |  \- org.alluxio:alluxio-shaded-client:jar:2.7.0:compile
-[INFO] +- org.apache.hudi:hudi-common:jar:0.9.0:compile
-[INFO] +- org.apache.hudi:hudi-hadoop-mr:jar:0.9.0:compile
+[INFO] +- org.apache.hudi:hudi-presto-bundle:jar:0.10.0:compile
 [INFO] +- com.facebook.presto:presto-memory-context:jar:0.268-SNAPSHOT:compile
 [INFO] +- com.facebook.presto:presto-hive-common:jar:0.268-SNAPSHOT:compile
 [INFO] +- com.facebook.presto:presto-hive-metastore:jar:0.268-SNAPSHOT:compile
@@ -249,6 +248,6 @@
 [INFO] ------------------------------------------------------------------------
 [INFO] BUILD SUCCESS
 [INFO] ------------------------------------------------------------------------
-[INFO] Total time:  2.809 s
-[INFO] Finished at: 2021-12-23T12:51:07+05:30
+[INFO] Total time:  2.857 s
+[INFO] Finished at: 2021-12-23T12:52:44+05:30
 [INFO] ------------------------------------------------------------------------

@codope codope force-pushed the hudi-metadata-listing branch from fa6e417 to a25fee9 Compare December 22, 2021 09:57
Copy link
Copy Markdown

@arunthirupathi arunthirupathi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left one comment, but the rest looks good.

Comment thread presto-iceberg/pom.xml Outdated
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the exclusion required here and other directories ?

Copy link
Copy Markdown
Contributor Author

@codope codope Dec 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without ignoring, the maven duplicate finder plugin complains about "Found duplicate (but equal) classes in [dependency A, dependency B]."

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the prersto-hudi bundle contains copy of this class , but not shaded ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I just created a PR to shade them apache/hudi#4495

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When will the new presto-hudi bundle available ? Can you use the new presto-hudi bundle instead of this one ?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arunthirupathi We plan a 0.10.1 minor release in Jan. Can we merge this and we can follow up after? Do you see any issues with this exclusion.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arunthirupathi We plan a 0.10.1 minor release in Jan. Can we merge this and we can follow up after? Do you see any issues with this exclusion.

I don't see any, if you folks make effort to release this in Jan and remove the exclusion later. As usual, I will revert on first sign of trouble.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once the test passes, I will merge this change.

@codope codope force-pushed the hudi-metadata-listing branch from a25fee9 to 7a65d9e Compare December 23, 2021 17:45
Copy link
Copy Markdown

@arunthirupathi arunthirupathi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are failing tests, Can you please rebase and squash the commits ?

Comment thread presto-iceberg/pom.xml Outdated
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the prersto-hudi bundle contains copy of this class , but not shaded ?

@codope codope force-pushed the hudi-metadata-listing branch 3 times, most recently from ff77c1d to fe74619 Compare January 3, 2022 07:50
@codope
Copy link
Copy Markdown
Contributor Author

codope commented Jan 3, 2022

There are failing tests, Can you please rebase and squash the commits ?

Done. But looks like there is one flaky test - TestDistributedQueryResource#testGetAllQueryInfoForLimits. It passes locally. Can we retry the failing check?

@arunthirupathi
Copy link
Copy Markdown

For the failing tests, rebase and push couple of times. If it fails after two attempts let me know, I can take a look and either disable or reach out to the author and fix it.

@codope codope force-pushed the hudi-metadata-listing branch from fe74619 to 716250a Compare January 4, 2022 09:36
- Integrate metadata-based listing for Hudi tables. This is
enabled by a session property.
- Implement a new DirectoryLister for Hudi that uses
HoodieTableFileSystemView to fetch data files.
- Bump Hudi version to 0.10.0 to use above features.

Add unit tests for HudiDirectoryLister

Replace hudi-common and hudi-hadoop-mr by hudi-presto-bundle
@codope codope force-pushed the hudi-metadata-listing branch from 716250a to 618aa95 Compare January 4, 2022 16:06
@arunthirupathi arunthirupathi merged commit 070c995 into prestodb:master Jan 4, 2022
@cocozianu
Copy link
Copy Markdown
Contributor

We've been running into trouble with his approach and in particular with hudi-presto-bundle 0.10.0
I created the issue: #17164
and the associated revert pull request
#17165

@arunthirupathi
Copy link
Copy Markdown

First, I apologize for the issue caused. I was concerned about the unshaded classes, but all the tests passed, so I thought this was not an issue.

Can you please share more details about the errors you are running into ? I will take a look and see if I can add more tests/errors to exercise them. Give as much details as possible, to prevent future regressions.

@cocozianu
Copy link
Copy Markdown
Contributor

Hi Sagar, Sumit.

Current conflicts, in facebook's installation of presto:

Found duplicate and different classes in [com.facebook.presto.hive:hive-apache:3.0.0-7, org.apache.hudi:hudi-presto-bundle:0.10.0]:
[WARNING] org.apache.parquet.avro.AvroCompatRecordMaterializer
[WARNING] org.apache.parquet.avro.AvroDataSupplier
[WARNING] org.apache.parquet.avro.AvroIndexedRecordConverter
[WARNING] org.apache.parquet.avro.AvroParquetInputFormat
[WARNING] org.apache.parquet.avro.AvroParquetOutputFormat
[WARNING] org.apache.parquet.avro.AvroParquetWriter
[WARNING] org.apache.parquet.avro.AvroReadSupport
[WARNING] org.apache.parquet.avro.AvroRecordConverter
[WARNING] org.apache.parquet.avro.AvroRecordMaterializer
[WARNING] org.apache.parquet.avro.AvroSchemaConverter
[WARNING] org.apache.parquet.avro.AvroWriteSupport
[WARNING] org.apache.parquet.avro.GenericDataSupplier
[WARNING] org.apache.parquet.avro.ParentValueContainer
[WARNING] org.apache.parquet.avro.ReflectDataSupplier
[WARNING] org.apache.parquet.avro.SpecificDataSupplier

@vinothchandar
Copy link
Copy Markdown
Collaborator

Hmmm. Not sure if we can shade parquet-avro, but we could skip bundling it with the hudi-presto-bundle. We will take a closer look at the hudi-presto-bundle, we have a 0.10.1 upcoming.

Apologies for your trouble!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants